{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 203 - Hyperparameter Tuning with MMLSpark\n",
    "\n",
    "We can do distributed randomized grid search hyperparameter tuning with MMLSpark.\n",
    "\n",
    "First, we import the packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import mmlspark\n",
    "from pyspark.sql.types import IntegerType, StringType, FloatType, StructType, StructField"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's read the data and split it to tuning and test sets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataFilePath = \"BreastCancer.csv\"\n",
    "textSchema = StructType([StructField(\"Label\", IntegerType(), False),\n",
    "                         StructField(\"Clump Thickness\", IntegerType(), False),\n",
    "                         StructField(\"Uniformity of Cell Size\", IntegerType(), False),\n",
    "                         StructField(\"Uniformity of Cell Shape\", IntegerType(), False),\n",
    "                         StructField(\"Marginal Adhesion\", IntegerType(), False),\n",
    "                         StructField(\"Single Epithelial Cell Size\", IntegerType(), False),\n",
    "                         StructField(\"Bare Nuclei\", FloatType(), False),\n",
    "                         StructField(\"Bland Chromatin\", IntegerType(), False),\n",
    "                         StructField(\"Normal Nucleoli\", IntegerType(), False),\n",
    "                         StructField(\"Mitoses\", IntegerType(), False),])\n",
    "import os, urllib\n",
    "if not os.path.isfile(dataFilePath):\n",
    "    urllib.request.urlretrieve(\"https://mmlspark.azureedge.net/datasets/\" + dataFilePath, dataFilePath)\n",
    "data = spark.createDataFrame(pd.read_csv(dataFilePath, sep=\",\", header=0, na_values=\"?\"), textSchema)\n",
    "tune, test = data.randomSplit([0.80, 0.20])\n",
    "tune.limit(10).toPandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, define the models that wil be tuned:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mmlspark import TuneHyperparameters\n",
    "from mmlspark.TrainClassifier import TrainClassifier\n",
    "from pyspark.ml.classification import LogisticRegression, RandomForestClassifier, GBTClassifier\n",
    "logReg = LogisticRegression()\n",
    "randForest = RandomForestClassifier()\n",
    "gbt = GBTClassifier()\n",
    "smlmodels = [logReg, randForest, gbt]\n",
    "mmlmodels = [TrainClassifier(model=model, labelCol=\"Label\") for model in smlmodels]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can specify the hyperparameters using the HyperparamBuilder.\n",
    "We can add either DiscreteHyperParam or RangeHyperParam hyperparameters.\n",
    "TuneHyperparameters will randomly choose values from a uniform distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mmlspark import HyperparamBuilder\n",
    "from mmlspark import RangeHyperParam\n",
    "from mmlspark import DiscreteHyperParam\n",
    "from mmlspark import RandomSpace\n",
    "paramBuilder = \\\n",
    "  HyperparamBuilder() \\\n",
    "    .addHyperparam(logReg, logReg.regParam, RangeHyperParam(0.1, 0.3)) \\\n",
    "    .addHyperparam(randForest, randForest.numTrees, DiscreteHyperParam([5,10])) \\\n",
    "    .addHyperparam(randForest, randForest.maxDepth, DiscreteHyperParam([3,5])) \\\n",
    "    .addHyperparam(gbt, gbt.maxBins, RangeHyperParam(8,16)) \\\n",
    "    .addHyperparam(gbt, gbt.maxDepth, DiscreteHyperParam([3,5]))\n",
    "searchSpace = paramBuilder.build()\n",
    "# The search space is a list of params to tuples of estimator and hyperparam\n",
    "print(searchSpace)\n",
    "randomSpace = RandomSpace(searchSpace)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, run TuneHyperparameters to get the best model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bestModel = TuneHyperparameters(\n",
    "              evaluationMetric=\"accuracy\", models=mmlmodels, numFolds=2,\n",
    "              numRuns=len(mmlmodels) * 2, parallelism=1,\n",
    "              paramSpace=randomSpace.space(), seed=0).fit(tune)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can view the best model's parameters and retrieve the underlying best model pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(bestModel.getBestModelInfo())\n",
    "print(bestModel.getBestModel())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can score against the test set and view metrics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mmlspark import ComputeModelStatistics\n",
    "prediction = bestModel.transform(test)\n",
    "metrics = ComputeModelStatistics().transform(prediction)\n",
    "metrics.limit(10).toPandas()"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
